Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

discovery for RNA-seq sequencing count data

ith the significant cost reduction of the sequencing technology,

g gene expression pattern or detecting differentially expressed

sed on sequencing count data has been more and more popular in

l/medical research. The sequencing count data are normally

d using the next-generation sequencer such as the so-called

machine [Behjati and Tarpey, 2013; Forde and O’Toole, 2013;

, et al., 2014]. A next-generation sequencer generates short

s, which are normally 100 base pairs long or shorter. Such a short

is called a sequencing read. One of the major applications of the

ng count data is the transcriptome data for gene differential

n pattern discovery [Crowgey, et al., 2020; Goswami and

2020]. Therefore, the majority of the sequencing count data used

vering differentially expressed genes is called the RNA-seq count

st be noted that an individual sequencing read has no biological

The biological meaning of sequencing reads can be investigated

n they have been aligned or mapped to a reference genome. There

al packages for mapping or align collected sequencing reads to a

reference genome, such as BWA [Li and Durbin, 2009] and

Langmead, et al., 2009], etc. An active gene may attract a greater

of sequencing reads while an inactive gene may attract little or

uencing reads. Only after sequencing reads have been mapped

to a reference genome, it is then possible to assess whether a

more active and less active than other genes depending on the

f sequencing reads which hit the gene. The direct outcome of the

from sequencing reads to a reference genome is a sequencing

trix across genes and replicates. Only after such a matrix has been

d, discovering DEGs based on sequencing count data can then